Extracting Protein Names from Biological Literature

نویسنده

  • Huang-Cheng Kuo
چکیده

Name entity recognition is an essential task in extracting biological knowledge. In biological corpus, protein names and other terminologies are mixed in natural language sentences. Sometimes whether an abbreviation is a protein name or not depends on the context. Protein names are often composed of gene names, cell names, or even drug names. Moreover, the number of newly coined protein names is increasing. Even with the assistance of a dictionary, it is still hard to correctly automatically identify all protein names in a biological corpus. We modify a hierarchical model of protein name tokens. On the one hand, we choose rule-base method to improve protein name recognition prediction accuracy rate. On the other hand, we use the N-gram language model to determine the boundary of protein name. Numerous studies mentioned that the hardest part is to identify abbreviations and words beginning with uppercase. In order to enhance the recognition performance, we use a dictionary to strengthen recognition for abbreviations and words beginning with uppercase. Experimental results show that about 10% increase in performance.We use YAPEX corpus and GENIA corpus datasets for experiment. In our study, an F-score can achieve 0.697 on the YAPEX corpus and 0.691 on the GENIA corpus. Finally, strengthening the abbreviation for part recognition, we use the Uniprot dictionary database to recognize, an F-score can achieve 0.797 on the YAPEX corpus and 0.806 on the GENIA corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NLProt: extracting protein names and sequences from papers

Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagge...

متن کامل

Extracting synonymous gene and protein terms from biological literature

MOTIVATION Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. RESULTS We have explored four com...

متن کامل

Bayesian inference of protein-protein interactions from biological literature

MOTIVATION Protein-protein interaction (PPI) extraction from published biological articles has attracted much attention because of the importance of protein interactions in biological processes. Despite significant progress, mining PPIs from literatures still rely heavily on time- and resource-consuming manual annotations. RESULTS In this study, we developed a novel methodology based on Bayes...

متن کامل

Mining Physical Protein-protein Interactions from Literature

Background: Physical protein-protein interactions are fundamental to understand both the functions of proteins and the entire biological processes. Due to the development of high throughput experimental technologies such as the yeast two-hybrid screening, the interaction data are growing in an increasing speed. Manual curation which spends much time and cost could not keep up with the rapid gro...

متن کامل

Contentment and Architecture An Investigation of the Manifestation of the Concept of Contentment in the Pattern of Iranian Traditional Houses (Case Study: Mortaz House)

The concept of contentment derived from the content is one of the names of Allah that in Islamic foundations has been emphasized. The traditional man, relying on these bases, to illustrate the divine names, he has tried in different aspects of his life. Architecture is one of the areas which these names appear in it and from a variety of architectures; the house provides the most possible for t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014